-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradual rollout #110
base: main
Are you sure you want to change the base?
Gradual rollout #110
Conversation
401b378
to
6c79808
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sadly can't comment on the lines because Github:
https://github.com/telekom/das-schiff-network-operator/pull/110/files#diff-a96964d7107a0881e826cd2f6ac71e901d612df880362eb04cb0b54eb609b8e5L70-L90
this does not seem required anymore when the Layer2 Network Configurations are handled by the ConfigReconciler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what happens when a node joins(!) or leaves the cluster? Especially when the joining or leaving happens mid of a rollout?
f73e182
to
7dd91cf
Compare
If Node joins the cluster it should be configured in next reconciliation loop iteration (I think, will check that to be sure). But, on node leave, config will be tagged as invalid (as it should timeout) and configuration will be aborted. I'll try to fix that. |
would be nice to watch node events in the central manager to create that nodeconfig before the next reconcile loop |
@chdxD1 |
a192158
to
9cbcd4f
Compare
9cbcd4f
to
dc5ac0f
Compare
31cd606
to
6c8cec3
Compare
8d37361
to
945b6d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work overall. I'd like to change a few things though, especially with regards to the gradual rollout mechanism.
17c6f79
to
bb8c516
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a few more things.
One thing I dislike with the current implementation is the lack of clear responsibilities of controllers. Most importantly, only one controller should be responsible for setting the status of a resource. This is not always the case at them moment. It's perfectly fine if status takes a while to be set after a resource is created, and I'd prefer it over guessing the status when creating the resource.
564d522
to
7db1953
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Way cleaner now, good job.
A few more minor things :)
7db1953
to
8cbcd3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm now, thanks!
I'll leave it for @chdxD1 for final review
8cbcd3e
to
3ea2c54
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with the overall flow and how it evolved now.
I rolled it out in one of our dev clusters and it worked, after kicking it a bit.
For some reason (not logged!) the config revision went into invalid state.
I would therefore ask to add some logging which would ease debugging in a lot of cases, like when marking a config as invalid, when creating new revisions, creating new nodeconfigs etc.
} | ||
|
||
if err := crr.deployConfig(ctx, newConfig, currentConfig, node); err != nil { | ||
if errors.Is(err, InvalidConfigError) || errors.Is(err, context.DeadlineExceeded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) I think InvalidConfigError is not used anywhere
b) if the context DeadlineExceeded happens (e.g. operator can not communicate with K8s API) I would prefer a retry instead of marking the revision as invalid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) deleted InvalidConfigError
b) I've added explicitly 3 retries. If it won;t be able to connect within 3 tries, reconciler error will be reported.
Also now all actions like creation/update/deletion of objects etc. should be logged.
3ea2c54
to
29fb032
Compare
if wasConfigTimeoutReached(&configs[i]) { | ||
// If timout was reached revision is invalid (but still counts as ongoing). | ||
invalid++ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I am struggling with this a bit.
Imagine the following scenario:
- A new node joins the cluster (or network operator is updated etc)
- network-operator is not running for various reasons for 2 minutes
I am thinking of a better solution to find out if a rollout has reached a timeout.
Some things to consider:
- If a node can not rollout config at all because of other issues the node should be marked as NotReady by the kubelet anyway, our automation will take care of removing that node alltogether (deleting the nodeconfig anyway)
- A rollout can only timeout once it was started
What would you think of a applyStarted
field in status which is written before a nodeconfig is rolled out (aka an agent has received the config but has not yet configured it). This is used to also update the lastUpdated timestamp and if applyStarted is true and lastUpdated has reached the timeout of 2 minutes the config is marked as failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we already have provisioning
, maybe just use that instead of applyStarted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but wait a second, that's already the case, let me look further..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if, before invalidating the revision, we would check if node that 'timed out' is still available (e.g it was not deleted or not in NotReady
state)? And if it was deleted or NotReady, we could then simply not use it's config's state to determine if revision should be invalidated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* A rollout can only timeout once it was started
Currently that's true, as only the agent that operates on the node updates the LastUpdate
timestamp in the config's status, so if there's any value there, it means that node stared to reconfigure itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see one issue: The ConfigStatus is by default an empty string and defaults to "", the lastUpdate is 0 (aka 1990...) if we check for "" as well in line 164: case StatusProvisioning, "":
and then check for the timeout it will always be true I guess?
ConfigStatus == "" should be a different case with a higher timeout first of all plus a check that lastUpdate is not 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've added longer timeout for configs with empty status, and added a check against lastUpdate
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, made timeout value configurable using flags.
c1c34e5
to
6ed1c53
Compare
e00cb48
to
cc119f7
Compare
Signed-off-by: Patryk Strusiewicz-Surmacki <[email protected]>
cc119f7
to
574bdef
Compare
This PR implements gradual rollout as described in #98
There are 2 new CRDs added:
NodeConfig
- which represents node configurationNodeConfigProcess
- which represents the global state of the configuration (can beprovisioning
orprovisioned
). This is used to check if previous leader did not fail in the middle of configuration process. If so, backups are restored.New pod added -
network-operator-configurator
- this pod (daemonset) is is responsible for fetchingvrfrouteconfigurations
,layer2networkconfigurations
androutingtables
and combining those intoNodeConfig
for each node.network-operator-worker
pod instead of fetching separate config resources, will now only fetchNodeConfig
. After configuration is done, and connectivity is checked, it will backup the config on disk. If connectivity is lost after deploying new config - configuration will be restored using the local backup.For each node there can be 3 NodeConfig objects created:
<nodename>
- current configuration<nodename>-backup
- backup configuration<nodename>-invalid
- last known invalid configurationHow does it work:
network-operator-configurator
starts and leader election takes place.if any config is inNodeConfigProcess
invalid
orprovisioning
state to check if previous leader did not die amid the configuration process. If so, it will revert configuration for all the nodes using backup configuration.vrfrouteconfigurations
,layer2networkconfigurations
and /orroutingtables
object,configurator
configurator will:NodeConfig
for each node- setNodeConfigProcess
state toprovisioning
provisioning
.network-operator-worker
fetches new config and configures node. It checks connectivity and:provisioned
invalid
.provisioned
- it proceeds with deploying next node(s).invalid
- it aborts the deployment and reverts changes on all the nodes that were changed in this iteration.Configurator can be set to update more than 1 node concurrently. Number of nodes for concurrent update can be set using
update-limit
configuration flag (defaults to 1).